Toward Efficient Post-Training Quantization of Pre-Trained Language Models

131

convex combination between the full-precision fn and quantized ˆfn as follows:

˜fn = λfn + (1λ) ˆfn.

(5.11)

The hyperparameter λ controls the strength of teacher forcing. λ = 1 gives the full cor-

rection of reconstruction error but with forward inconsistency, e.g., the connection between

the current module and previous quantized modules is broken. While λ = 0 reduces for-

ward inconsistency, it suffers from the propagated reconstruction error. To achieve a good

trade-offbetween reconstruction error reduction and forward inconsistency elimination, a

linear decay strategy for λ is proposed:

λt = max(1t

T0

, 0),

(5.12)

where T0 is the preset maximum steps of the decay. In the beginning, a large λ is desired

since each module is rarely optimized. Later, a small λ is preferred to transit to normal

training such that the forward inconsistency can be bridged. The remaining TT0 steps

stick to normal training so that each quantized module adapts to its own predecessors.

The comparsion between the proposed method and other existing state-of-the-art BERT

quantization methods are presented in Table 5.4. From Table 5.4, both the proposed MREM-

S and MREM-P outperform existing PTQ approaches in most cases, and even achieve results

close to QAT approaches. For example, the “W4-E4-A8” quantized MREM-S and MREM-P

have the averaged accuracies of 83.5% and 83.4% on MNLI respectively are on par with

“W2/4-E8-A8” quantized Q-BERT. In terms of the “W2-E2-A8” quantized models, our

MREM-S and MREM-P surpass GOBO by 11.7%and 11.3%on MNLI-m, respectively.

In summary, this paper’s contributions are as follows: (1) module-wise reconstruction

error minimization (MREM) that is a fast, memory-saving, and data-efficient approach

to improve the post-training quantization for language models; (2) a new model parallel

strategy based on MREM to accelerate post-training quantization with theoretical speed-

up for distributed training; and (3) annealed teacher forcing to alleviate the propagation of

reconstruction error and boost the performance.

TABLE 5.4

Results on the GLUE development set. “MREM-S” denotes sequential optimization.

Quantization #Bits (W-E-A)SizePTQMNLI-mQQPQNLISST-2CoLASTS-BMRPCRTEAvg.

-

full-prec.

418

-

84.9

91.4 92.1

93.2

59.7

90.1

86.3

72.2 83.9

Q-BERT

2-8-8

43

-

76.6

-

-

84.6

-

-

-

-

-

Q-BERT

2/4-8-8

53

-

83.5

-

-

92.6

-

-

-

-

-

Quant-Noise

PQ

38

-

83.6

-

-

-

-

-

-

-

-

TernaryBERT

2-2-8

28

-

83.3

90.1 91.1

92.8

55.7

87.9

87.5

72.9 82.7

GOBO

3-4-32

43

83.7

-

-

-

-

88.3

-

-

-

GOBO

2-2-32

28

71.0

-

-

-

-

82.7

-

-

-

MREM-S

4-4-8

50

83.5

90.2 91.2

91.4

55.1

89.1

84.8

71.8 82.4

2-2-8

28

82.7

89.6 90.3

91.2

52.3

88.7

86.0

71.1 81.5

MREM-P

4-4-8

50

83.4

90.2 91.0

91.5

54.7

89.1

86.3

71.1 82.2

2-2-8

28

82.3

89.4 90.3

91.3

52.9

88.3

85.8

72.9 81.6

Note: “MREM-P” denotes parallel optimization. “Size” refers to model storage in “MB”. “PTQ”

indicates whether the method belongs to post-training quantization.“Avg.” denotes the average

results of all tasks.